A novel index to aid in prioritizing habitats for site‐based conservation

Abstract Funding biodiversity conservation strategies are usually minimal, thus prioritizing habitats at high risk should be conducted. We developed and tested a conservation priority index (CPI) that ranks habitats to aid in prioritizing them for conservation. We tested the index using 1897 fish species from 273 African inland lakes and 34 countries. In the index, lake surface area, rarity, and their International Union for Conservation of Nature (IUCN) Red List status were incorporated. We retrieved data from the Global Biodiversity Information Facility (GBIF) and IUCN data repositories. Lake Nyasa had the highest species richness (424), followed by Tanganyika (391), Nokoué (246), Victoria (216), and Ahémé (216). However, lakes Otjikoto and Giunas had the highest CPI of 137.2 and 52.1, respectively. Lakes were grouped into high priority (CPI > 0.5; n = 56) and low priority (CPI < 0.5; n = 217). The median surface area between priority classes was significantly different (W = 11,768, p < .05, effect size = 0.65). Prediction accuracy of Random Forest (RF) and eXtreme Gradient Boosting (XGBoost) for priority classes were 0.912 and 0.954, respectively. Both models exhibited lake surface area as the variable with the highest importance. CPI generally increased with a decrease in lake surface area. This was attributed to less ecological substitutability and higher exposure levels of anthropogenic stressors such as pollution to a species in smaller lakes. Also, the highest species richness per unit area was recorded for high‐priority lakes. Thus, smaller habitats or lakes may be prioritized for conservation although larger waterbodies or habitats should not be ignored. The index can be customized to local, regional, and international scales as well as marine and terrestrial habitats.


| INTRODUC TI ON
Human-induced multiple stressors are exacerbating global biodiversity loss (Dobson, 1992;Dudgeon et al., 2006;Reid et al., 2019) and persistently altering ecosystem to function and provide services such as flood mitigation and food (Hooper et al., 2005;O'Connor & Crowe, 2005). These stressors, including climate change, habitat degradation, pollution, species invasions, and overexploitation, exhibit synergistic impacts on the ecosystems (Brook, 2008;Craig et al., 2017;Dudgeon et al., 2006;Hermoso et al., 2009), thereby increasing the complexity in managing and monitoring the ecosystem integrity (Craig et al., 2017). New communities are created within ecosystems (Pandolfi et al., 2020); loss of species spatial insurance is observed (Thompson et al., 2017); and ecological interactions within ecosystem are impeded, affecting species survival and ecosystem functioning and consequently increasing biodiversity loss (De Bernardi, 1981).
In addition, because freshwater ecosystems provide myriad ecosystem services to humans (McIntyre et al., 2016), they are predisposed to drastic and intermittent reclamation, increased pollution, and overexploitation (Hermoso et al., 2009). As a result, freshwater species have reduced by 76% in the last 50 years compared with 39% decline for marine and terrestrial populations (WWF, 2020). Of the 29,500 freshwater species assessed by the International Union for Conservation of Nature (IUCN), 27% are threatened with extinction and megafauna populations have declined by 88% from 1970 to 2012 (Tickner et al., 2020).
Freshwater biodiversity has consistently continued to decline (Hermoso et al., 2016), despite the proliferation in development and use of different conservation management strategies such as conservation planning tools, priority indices, and protected areas (Hermoso et al., 2016). The anomaly is alluded to the inability to manage freshwater protected areas, political interference, poor sensitization, and poor delineation (Bastin et al., 2019;Hermoso et al., 2016;Holland et al., 2012). Also, the ecological integrity is mostly assessed at species level rather than the whole ecosystem (Vié et al., 2008), albeit habitat loss and degradation being the major catalyst to freshwater biodiversity loss (Dudgeon et al., 2006;Vié et al., 2008). Furthermore, conservation priorities are mostly inclined to large waterbodies because of their high species richness, endemism, and threatened species (Sayer et al., 2018).
Likewise, most studies are skewed to large waterbodies (Biggs et al., 2017), despite the significance of small waterbodies as refugia for threatened species (Biggs et al., 2017;Olwa et al., 2020).
Besides species richness, ecosystem parameters such as surface area need to be incorporated in designing freshwater conservation strategies (Grzybowski & Glińska-Lewczuk, 2019). Also, ecosystem competing interests and costs should be considered (Dudgeon et al., 2006;Nieto et al., 2017). For effective conservation, the catchment may need to be included when delineating areas for conservation (Dudgeon et al., 2006;Nieto et al., 2017), except that there is high cost associated with conserving even small catchment (Dudgeon et al., 2006).
In this context, priority-based approaches are necessary to rank different habitats or waterbodies at high risk of degradation, but also practically viable for conservation (Howard et al., 2018). Elsewhere, prioritization indices have been implemented for caves in Brazilian Atlantic Rain Forest (Souza Silva et al., 2014); forests between Atlantic forest and Cerrado (de Mello et al., 2016), and rivers in the Mediterranean basin (Hermoso et al., 2009) to ensure preferential ecosystem selection for conservation. Freshwater ecosystems have not been widely considered as most of the studies have been waterbody or habitat specific, with limited information to rank them for site-based conservation. The paucity of data on most taxa, including fishes, had in the past curtailed broader-scale distribution analysis.
However, recently, substantial amounts of data on the occurrences of different taxa have been made freely available in online repositories such as the Global Biodiversity Information Facility portal and International Union for Conservation of Nature (IUCN). This study aims to construct and test a novel conservation priority index on the fish species distribution from African lakes. We apply two model ensemble methods, eXtreme gradient boosting (XGBoost) and Random Forest (RF), to determine the most important variables in ranking the lakes for conservation. The index will aid in preferentially selecting ecosystems for site-based conservation, especially where resources are limiting.

| Study area, data acquisition, and processing
We considered fish species records from all the lakes in Africa found in the Global Biodiversity Information Facility (GBIF, 2020). We retrieved the records with the occ_download_get function in rgbif package (Chamberlain et al., 2020), and genera names were changed in conformity with FishBase nomenclature (Froese & Pauly, 2021), which is based on Van Oijen (1996). We used the coordinates to correctly reference the records that were outside the geographic range described in FishBase. For instance, Haplochromis eduardii records that were found in Lake Albert in GBIF data were moved to Lake Edward, where the species is endemic (Froese & Pauly, 2021).
Records with incomplete scientific epithets such as Oreochromis spp.
and Thoracochromis spp. were discarded. Records without a lake or waterbody of origin, but with coordinates, were geo-referenced using Google Earth Pro or used habitat descriptions, verbatim locality, and location remarks (Figure 1).
To avoid duplication of lakes, we changed all lake nomenclature to their English names; for example, changing "lac Edouard" to Lake Edward. However, presently accepted names were maintained for lakes whose names have changed over time.
To retrieve the species IUCN conservation status from IUCN Red List database (www.iucnr edlist.org), we used iucn_summary and iucn_status functions in the taxize R package (Chamberlain et al., 2020). The species were classified according to IUCN Red List for threatened species (IUCN, 2012). Species with IUCN status of LR/nt were changed to near threatened (NT).
The data retrieved were processed through a pipeline and only 36,541 (5.16%) records were retained for analysis ( Figure 1). We used a species accumulation curve to evaluate whether most of the fish species from African lakes were represented in our data in order to test the conservation priority index (Figure 3). Fish species richness was determined for each lake. Both waterbody relative rarity and total IUCN weighting were computed for each species found in a particular lake and country. The distribution of analyzed fish species records was mapped ( Figure 2).

| Evaluating species relative rarity among waterbodies or habitats
Conservation strategies, laws, policies, and binding targets are mostly drawn at regional and international conventions such as Convention of Biological Diversity (CBD), and Ramsar Convention,

Convention on International Trade in Endangered Species of Wild
Fauna and Flora. However, national authorities or parties are vital in implementing or enforcing the treaties, laws, and policies such as CBD (1992). Thus, the species relative rarity, and consequently the conservation priority index, were computed at a national level.
We computed the species relative rarity (SRR) of each species as the ratio of the number of waterbodies where the species was found in a particular country to total number of waterbodies with fish species data and correctly georeferenced in that country (Equation 1). We introduced an arbitrary value of 1 to convert the ratio to be highest for the rarest species and approach 0 if a species is found in most of the waterbodies (Equation 1). We assumed that if a species is found in many waterbodies, it has a greater area of occupancy, and thus not threatened by a single stressor. In IUCN, such a species is listed as least concern (IUCN, 2012). Furthermore, a species that is threatened but found in many waterbodies or habitats would have a low relative rarity compared with a least concern species found in only one lake or waterbody or habitat. The relative rarity weight for a waterbody was computed as the sum of rarity weights for the species in that waterbody scaled to the total number of species in the same waterbody (Equation 1). Species relative rarity (SRR) for a species was computed as: Ws is the total number of lakes where the species was observed in particular country; and Wt is the total number of correctly referenced lakes obtained in particular country. The conservation priority index for the waterbody will be 0 if only one waterbody is considered in a particular country. Relative rarity for each waterbody is the summation of the relative rarity for each species found in that waterbody. Relative rarity for waterbody was computed as follows: In the study, for each correctly referenced waterbody, we collated its surface area from literature (Burgis & Symoens, 1987;Ogutu-Ohwayo et al., 1999;Olowo et al., 2004;Schofield & Chapman, 1999;Vanden Bossche & Bernacsek, 1990

| Incorporating species' IUCN Red List status in the conservation priority index
The IUCN Red List for threatened species is the most comprehensive and detailed databases that have evaluated the species' threat levels worldwide (Vié et al., 2008). The species evaluations are conducted at national, regional. or international levels to inform policies, laws, and targets (Vié et al., 2008). Most indices or conservation strategies are designed to cater for threatened species, while excluding data deficient, least concern, near threatened, and not evaluated species. This anomaly predisposes mostly not evaluated (NE) and data deficient (DD) species to become extinct unknowingly. Thus, IUCN suggested that DD and NE species should be considered as critically endangered until their status is known (IUCN, 2012). For least concern species, managers have put less effort to monitor the trends of their stock sizes. For example, species such as Labeo victorianus and Oreochromis esculentus which once dominated the commercial stocks in Lake Victoria (Cadwalladr, 1965) are now critically endangered (FishBase team RMCA & Geelhand, 2016). Thus, in this index, all species were considered and weights were given based on the threat level: highest and lowest weights assigned to extinct and least concern conservation categories, respectively. The weights were assigned as follows: ET = 7, EXw = 6, CR = 5, DD = 5, NE = 5, EN = 4, VU = 3, NT = 2, and LC = 1. We computed a conservation score (Cwt) as the product of total number of species in a given IUCN Red List category and weight assigned to that threat category, summed across all the IUCN Red List categories (Equation 3).
The conservation priority index (CPI) was then formulated as a product of the conservation score (IUCN total weights) and relative rarity for each species, summed across all of the species within a lake, divided by the area of the lake, and a scaling constant to account for the number of categories (Equation 4). Here, we used lake area as a penalty based on several assumptions. (1) The cost of conservation is generally higher for large lakes compared to small lakes.
(2) Biodiversity generally have limited room for adaptation to stressors in small waterbodies than large waterbodies. (3) Conservation actions are likely to be more effective in small lakes than in large waterbodies. These assumptions aimed at controlling for the size of the waterbody so that the index is not necessarily higher for larger lakes.
Where, for each waterbody: Aw is the total surface area of the waterbody. Cwt and RRw are the IUCN total weights and relative rarity for a particular waterbody. The value of 8 was a scaling constant indicating the total number of IUCN categories considered. Thus, the scaling constant depends on the IUCN categories the user considers in the index.

| Significance of the variables on the performance of the index
We used Pearson correlation to determine the relationship among species richness, IUCN total weighting, surface area, and waterbody relative rarity. To determine the relative importance of the variables in the conservation priority index (CPI), we arbitrarily classified the waterbody as high priority (CP1 > 0.5, n = 56) and low priority (CPI < 0.5, n = 217). We compared two ensemble machine learning algorithms: Random Forest (RF) and eXtreme Gradient Boosting (XGBoost) to develop model classification predictions for the priority classes (high and low). We converted the categorical variable (country where the lakes were found) into dummy variables (numerical codes) using One-Hot Encoding, and this variable was included in the model to account for national variations in species composition and conservation status. The pre-processed data were randomly partitioned into training (70%) and testing (30%). Random Forest (RF) training model was tuned with tuneRF function in ran-domForest package with a step factor of 2, and an improve rate of (Liaw & Wiener, 2002). Partial dependence plots were used to indicate how the index varied with changes in the index parameters, namely conservation score (IUCN total weights, waterbody relative rarity, country, and surface area). For XGBoost, parameters such as mglogloss (evaluation metric) and softprob (objective) were included in the watch list prior to constructing the best model. The model performance was measured using metrics including recall (sensitivity), specificity, precision, F1-score, and accuracy (Yokoyama & Yamaguchi, 2020).
For both models, variable importance plot was determined. We used both XGBoost (tree combinations at the start) and RF (independent tree building) to compare the predictability accuracy and identify the best algorithms to classify the data. Both algorithms are suitable for variances on the input data, and can handle overfitting with variation on the hyperparameter tuning. We used the Wilcoxon signed-rank test to compare the differences in the median surface area, IUCN total weights, and the waterbody relative rarity for the two priority classes (high and low). The effect size was computed with the wilcox_effsize function from rstatix package. Ahémé (216). In lakes Gashanga, Kingiri, and Saka, only one fish species was observed in the assessed GBIF data. Of 1897 species, 1269 (66.9%) were least concern, 209 (11.0%) not evaluated, 160 (8.4%) data deficient, 81 (4.1%) vulnerable, 92 (4.8%) critically endangered, 45 (2.4%) endangered, 39 (2.1%) near threatened, and 2 (0.1%) were extinct.

| Conservation priority index, classification of lakes, and model predictions
Lakes Otjikoto and Guinas from Namibia had the highest conservation priority index of 137.5 and 52.1, respectively, followed by Lake Nkuruba in Uganda at 34.1 (Appendix 1). CPI for lakes Ngami, Piso, and Faguibine was zero (Appendix 1). After classifying the lake into 2 priority conservation classes, of 273 lakes, 56 (20.5%) were of high conservation priority class and 217 (79.5%) with low priority (Appendix 1). Uganda had the highest number of lakes with high priority (14), followed by Cameroon, Madagascar, and South Africa (6) (Appendix 1). The median surface area (km 2 ) for low priority class (64) Figure 5). When the surface area of the lake was small, the model predicted a high priority class ( Figure 5a). In contrast to surface area, the increase in both IUCN total weights and waterbody relative rarity led to prediction of high priority class for the lakes by RF model (Figure 5b,c). The model prediction for priority classes versus the 34 country codes was not presented in plots.
In both models, surface area had the highest variable importance (19.2% mean decrease in accuracy and 29.3% mean decrease in Gini for RF) and 72.7% Gain for XGBoost (Figures 6 and 7). In RF, surface area was followed by IUCN total weights (6.44%), rarity (4.4%), and least for country codes. Similarly, in XGBoost, surface area was followed by IUCN total weights (20.8%), waterbody relative rarity, and country codes. In both models, the country variable was converted into dummy codes and only three country codes were significant and displayed in the plot (Figures 6 and 7).  (Eschmeyer, 2005). The species accumulation curve approached asymptote, which suggested that most of the species were considered in the experimentation of the index (Gotelli & Colwell, 2001 (Rosenzweig, 1995). Differences in species richness and relative rarity among waterbodies are attributed to geomorphological, abiotic, and biotic factors (Brown et al., 2007).

Model
Also, isolated inland waterbodies favor rapid allopatric speciation and adaptive radiation because the species are exposed to different evolutionary pressures (Basiita et al., 2018). Lakes Nyasa, Victoria, and Tanganyika are endowed with diversity haplochromine cichlids, which has been attributed to the rapid speciation rates associated with habitat heterogeneity and disruptive sexual selection (Salzburger et al., 2005;Seehausen, 2000). The haplochromine lineages are endemic to Lake Tanganyika (Salzburger et al., 2005).
The geomorphological barriers, for example, the Murchison Falls along Victoria Nile hindered fish migration to Lake Kyoga from Albert (Basiita et al., 2018). The falls along River Semiliki prevent fish passage to Lake Albert from Edward (Acere & Mwene-Beyanga, 1990). Similarly, Lake Victoria and Kyoga were previously separated by the Owen and Bujagali falls (Basiita et al., 2018), and a sandbar separated Lake Nabugabo and Victoria (Stager et al., 2005). These biogeographical barriers may have led to allopatric speciation.
The high species richness of Lake Nokoué could be attributed to its high habitat and seasonal variability; for example, the estuarine, freshwater, and marine habits (Lalèyè et al., 2003). The lake is connected to Atlantic Ocean, Porto-Novo lagoon (30 km 2 ), and the Ouémé Delta (Lalèyè et al., 2003). These habitats support a diversity of species within the lake. However, Lalèyè et al. (2003) identified only 51 species from Lake Nokoué while understanding its spatial and seasonal ichthyofaunal distribution. GBIF holds data from different sources, and thus the 216 species obtained from the data could be attributed to different data sources.
Lake Nokoué, similar to lakes Nyasa, Victoria, and Tanganyika is threatened by anthropogenic pressure due to intense settlement around them. According to the Global Nature Fund (2019)

| Conservation priority index and model predictions
Lakes Giunas and Otjikoto had the highest conservation priority index. The lakes harbor Tilapia guinasana-a critically endangered species introduced from Lakes Giunas to Otjikoto (Skelton, 1978).
Generally, the restricted size of the lakes increases the vulnerability of their biota to sudden changes in ecological pressures such as alien species invasion and habitat degradation (Irish, 1991;Skelton, 1978). For example, T. guinasana was introduced to Lake

Otjikoto from Giunas and invaded the niches for populations of
Pseudocrenilabrus philander (Irish, 1991). Both lakes are threatened by habitat degradation and pollution because they are surrounded by agricultural fields (Irish, 1991). Lake Otjikoto was declared a national monument in Namibia, and the only underwater museum in the world because ammunitions were dumped in the lake after Although lakes Nyasa, Victoria, Tanganyika, and Nokoué had the highest species richness, their CPI were low. The index considered surface area as a proxy for the cost of implementing a biodiversity strategy that would be applied on the lake. According to article 5a of the Convention on Biological Diversity multilateral treaty, each contracting party shall "develop national strategies, plans or programmes for the conservation and sustainable use of biological diversity or adapt for this purpose existing strategies plans or programmes which shall reflect, inter alia, the measures set out in this Convention relevant to the Contracting Party concerned." Therefore, the cost involved in conserving Lake Tanganyika, which is shared by four countries, would require harmonized national strategies. Also, because the surface area is large, ecological variabilities are available for the species to seek refugia if its native habitat or niche is affected or invaded by a predator.
For example, in Lake Victoria, rocky dwelling Paralabidochromis species were not highly affected by the invasion of Lates niloticus (Balirwa et al., 2003). In small lakes or habitats, due to less ecological variability, the species are highly exposed to ecological threats. Also, it requires less costs to conserve small habitats and local community conservation approach can easily be applied. The CPI of the lake was zero if it was only one lake or habitat assessed in a particular country, thus no comparisons could be made for priority selection.

| Classification and prediction of habitat priority
Random Forest (RF) and eXtreme Gradient Boosting (XGBoost) had similar model prediction accuracy, sensitivity (recall), F1 score, precision, and specificity for both training and testing data. After classification of the lakes into high and low priority, the partial dependence plots from RF models showed that high priority classes were predicted at smaller surface area of the lake. The higher the surface, the lower the priority for conservation. Partial dependence plots are vital in determining the relationship between the variables and the predicted probabilities of the classes (Cutler et al., 2007).
Surface area was ranked as the variable with highest importance in the index by both models. Although the species rarity, richness, and threat status are vital biodiversity conservation science, the ability and success of managing the habitat will depend on the costs required to implement the strategies. Biodiversity funding mostly in developing countries and world over is still minimal (Bayon et al., 2000), despite the 33 trillion US dollars that is averagely generated from ecosystem services annually (Costanza et al., 1997). The IUCN key biodiversity areas (KBA) are sites vital to protect global biodiversity dependent on key trigger species (IUCN, 2016). However, priority areas for conservation such as protected area can lie beyond key biodiversity areas (IUCN, 2016).
Similar to study, IUCN (2016) noted that for conservation priority areas, the cost, connectivity, and evolutionary history should be considered.

| CON CLUS ION
The aim of this study was to construct a novel conservation priority index to aid in selection of a habitat for site-based conservation. The index was designed and tested on 1897 fish species from 273 Africa inland lakes in 34 countries. We applied two model ensemble methods, eXtreme gradient boosting (XGBoost) and Random Forest (RF), to determine the most important variables in ranking the lakes for conservation. Results showed that lake surface area was the most important variable for ranking habitats for site-based conservation.
While species richness is generally higher for large lakes compared to small ones, this study suggests that smaller waterbodies need to be prioritized for because of the low habitat heterogeneity, low ecological substitutability for the species, and higher levels of exposure to human-induced threats in small waterbodies compared to large systems. For large systems with vast habitat heterogeneity, fish species can easily seek refugia in other habitats. This index can be applied at local, national, and regional scale for other taxa, and can aid in preferentially selecting ecosystems for site-based conservation, especially where resources are limiting. This index can be greatly affected by incorrect identification of the species. Also, species need to be correctly geolocated to avoid over-or underweighting/ranking of the waterbody, habitat, or any ecosystem.

ACK N OWLED G M ENTS
The authors wish to thank all the teams that contributed data in the Global Biodiversity Information Facility (GBIF) and International Union for Conservation of Nature (IUCN) that were freely available and used in the entire manuscript. We thank Mugeni Bairon for designing the map of records.

CO N FLI C T O F I NTE R E S T
The authors declare no conflict of interest.

DATA AVA I L A B I L I T Y S TAT E M E N T
The data can be accessed on Dryad (DOI https://doi.org/10.5061/ dryad.4b8gt htcx).